Machine learning in the FinTech industry
Problem statement: The company is facing challenges in the loan procedure, specifically the issue of defaulters. The company wants to transform its loan procedure using Machine Learning model to improve decision-making and prevent defaulters.
Solution: The company can use AI and automation to analyze customer data in real-time and predict the likelihood of a customer defaulting on their loan. Machine learning models such as logistic regression, decision trees, random forests, and neural networks can be used to make better decisions regarding loan approval. The data analytics team should consider relevant parameters such as having enough data to train the model accurately, choosing the right features to predict the likelihood of default, and considering the performance metrics to evaluate the model. The future of AI and automation in the loan procedure looks promising as it can improve decision-making, reduce the risk of defaulters, and provide better customer service. This can have a positive impact on the economy by improving the financial health of individuals and businesses, leading to increased investment and economic growth.
how can our company transform with the use of AI and automation to solve the loan procedure and prevent defaulters?
Overview of how we could implement this:
With the help of AI and automation, we can analyze customer data in real-time and predict the likelihood of a customer defaulting on their loan. We can use machine learning models such as logistic regression, decision trees, random forests, and neural networks to make better decisions regarding loan approval.
What parameters should we consider when building these machine learning models?
Data analytics team should consider several parameters when building these models. Firstly, we need to ensure that we have enough relevant data to train the model accurately. Secondly, we need to choose the right features to predict the likelihood of default. And finally, we need to consider the performance metrics such as precision, recall, and accuracy that we're using to evaluate the model.
What do you think the future of AI and automation in the loan procedure looks like?
I believe the future looks very promising for AI and automation in the loan procedure. As we continue to collect more data and improve our machine learning models, we'll be able to make better decisions regarding loan approval and reduce the risk of defaulters. Additionally, we can provide better customer service through automation and personalized loan servicing, which could lead to higher customer satisfaction and loyalty.
How do you think this will affect the recession in the market?
I think this will have a positive impact on the economy. By reducing the risk of defaulters, the financial health of individuals and businesses will improve, which could lead to increased investment and economic growth. It's a win-win situation for both our company and the economy.
MLOps
The MLOps pipeline can involve data collection and preprocessing, feature engineering, model training, model deployment, and monitoring. The pipeline can be automated using tools such as docker, Azure ML and Kubernetes. The pipeline can be deployed on cloud infrastructure such as AWS, GCP, or Azure.
About Dataset
A simulated financial dataset has been generated using genuine information from a financial organization. The dataset has been altered to eliminate any identifying characteristics and the figures have been altered to prevent any linkage to the original source (the financial institution). The purpose of using this dataset is to give trainee a simple financial dataset to use when practicing financial analytics for POC.
Highlights of the Loan Default Classification:
-Classification, Imbalanced Data, and PR Curve
Contents
EDA
Check NaN values
Data overview
Feature engineering
Data distribution
Modeling
Train test split
Standardization
Upsampling by SMOTE
Logistic regression
Support vector machine
Random forest
LightGBM
XGBoost
Model assessment
ROC curve
PR curve
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib import pyplot as plt
import os
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_curve, auc, roc_curve
import seaborn as sns
dataset = pd.read_csv('data/FinTech_Dataset.csv',index_col=0)
print(dataset.head())
Employed Bank Balance Annual Salary Defaulted? Index 1 1 8754.36 532339.56 0 2 0 9806.16 145273.56 0 3 1 12882.60 381205.68 0 4 1 6351.00 428453.88 0 5 1 9427.92 461562.00 0
dataset.shape
(10000, 4)
dataset.isna().sum()
#As there are no NaN values in this data, invalid values are not a major concern.
Employed 0 Bank Balance 0 Annual Salary 0 Defaulted? 0 dtype: int64
dataset.describe()
| Employed | Bank Balance | Annual Salary | Defaulted? | |
|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 0.705600 | 10024.498524 | 402203.782224 | 0.033300 |
| std | 0.455795 | 5804.579486 | 160039.674988 | 0.179428 |
| min | 0.000000 | 0.000000 | 9263.640000 | 0.000000 |
| 25% | 0.000000 | 5780.790000 | 256085.520000 | 0.000000 |
| 50% | 1.000000 | 9883.620000 | 414631.740000 | 0.000000 |
| 75% | 1.000000 | 13995.660000 | 525692.760000 | 0.000000 |
| max | 1.000000 | 31851.840000 | 882650.760000 | 1.000000 |
The column labeled "Employed" is of categorical type, while the "Bank Balance" and "Annual Salary" columns are numerical. Our objective is to perform a binary classification task based on the target column "Defaulted."
dataset.insert(3, 'Saving Rate', dataset['Bank Balance'] / dataset['Annual Salary'])
print(dataset.head())
Employed Bank Balance Annual Salary Saving Rate Defaulted? Index 1 1 8754.36 532339.56 0.016445 0 2 0 9806.16 145273.56 0.067501 0 3 1 12882.60 381205.68 0.033794 0 4 1 6351.00 428453.88 0.014823 0 5 1 9427.92 461562.00 0.020426 0
We generate a new feature named "Saving Rate" based on the "Bank Balance" and "Annual Salary" data. The Saving Rate feature provides insight into the spending habits of each user. Generally, a user with a higher Saving Rate is considered less likely to default. We will investigate the relationship between these variables in greater detail later on.
Default distribution
tbl = dataset['Defaulted?'].value_counts().reset_index()
tbl.columns = ['Status', 'Number']
tbl['Status'] = tbl['Status'].map({1 :'Defaulted', 0 :'Not defaulted'})
print(tbl)
Status Number 0 Not defaulted 9667 1 Defaulted 333
fig = px.pie(tbl,
values='Number',
names = 'Status',
title='Default Status')
fig.show()
Loan defaults would only impact 3% of customers, creating in an imbalanced classification.
Employed distribution
tbl = dataset['Employed'].value_counts().reset_index()
tbl.columns = ['Status', 'Number']
tbl['Status'] = tbl['Status'].map({1 :'Employed', 0 :'Unemployed'})
tbl
| Status | Number | |
|---|---|---|
| 0 | Employed | 7056 |
| 1 | Unemployed | 2944 |
fig = px.pie(tbl,
values='Number',
names = 'Status',
title='Employed Status')
fig.show()
tbl = dataset.copy()
tbl['Employed'] = tbl['Employed'].replace({1 :'Employed', 0 :'Unemployed'})
tbl['Defaulted?'] = tbl['Defaulted?'].replace({1 :'Defaulted', 0 :'Not defaulted'})
fig = px.sunburst(tbl,
path=['Employed','Defaulted?'],
title='Relationship between Employment and Loan Default')
fig.show()
Contingency table
tbl = pd.crosstab(dataset['Employed'],dataset['Defaulted?'])
print(tbl)
Defaulted? 0 1 Employed 0 2817 127 1 6850 206
Pearson’s χ2 test for independence
chi2, p, dof, ex = chi2_contingency(tbl)
print("p-value:", p)
p-value: 0.0004997256756210478
Conclusion: As their p-value is between 0.0005 and 0.05, we draw the conclusion that they are not independent. Employed status can therefore be used to predict default.
Bank Balance distribution
fig = px.histogram(dataset, x="Bank Balance", color='Defaulted?',
marginal="box", # or violin, rug
hover_data=dataset.columns)
fig.show()
We find that this is an asymmetric distribution, with many people having zero bank balance.
Let's further check this by calculating number of accounts with less than 10 dollars.
(dataset['Bank Balance'] <= 10).sum()
501
Conclusion:
Approximately 500 individuals have hardly saved any money in their bank accounts, which could pose a risk for loan defaults. Surprisingly, those who have defaulted on their loans tend to have a higher balance in their bank accounts. This observation may seem counterintuitive and suggests the presence of confounding factors. It is possible that individuals with a higher bank balance may have easier access to loans, leading to a higher number of defaults.
Annual Salary distribution
fig = px.histogram(dataset, x="Annual Salary",
color="Defaulted?",
marginal="box", # or violin, rug
hover_data=dataset.columns)
fig.show()
Conclusion:
Saving Rate distribution
fig = px.histogram(dataset, x="Saving Rate",
color='Defaulted?',
marginal="box", # or violin, rug
hover_data=dataset.columns)
fig.show()
Conclusion:
The distribution of saving rate is similar to that of bank balance, but with a few extreme outliers. This suggests that people's saving habits can vary significantly. Some individuals may earn a high income but spend more than they save, while others with relatively low salaries may have a significant amount of savings.
Train test split
RAND_SEED = 123
X_train, X_test, y_train, y_test = train_test_split(dataset.iloc[:,:-1], dataset.iloc[:,-1], test_size=0.3, stratify=dataset.iloc[:,-1], random_state=RAND_SEED)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(7000, 4) (3000, 4) (7000,) (3000,)
Standardization
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Upsampling by SMOTE
During the Exploratory Data Analysis (EDA) phase, it was observed that defaulted cases constituted only 3% of the samples. This highly imbalanced dataset could pose a challenge for classification models that aim to minimize the cost function. To address this issue, the SMOTE upsampling method was introduced to rebalance the dataset.
X_train.shape, y_train.shape
((7000, 4), (7000,))
y_train.value_counts()
0 6767 1 233 Name: Defaulted?, dtype: int64
#pip install --upgrade imbalanced-learn
sm = SMOTE(random_state=RAND_SEED)
X_train, y_train = sm.fit_resample(X_train, y_train)
X_train.shape, y_train.shape
((13534, 4), (13534,))
y_train.value_counts()
0 6767 1 6767 Name: Defaulted?, dtype: int64
The models we will examine include </br> Logistic Regression, </br> Support Vector Machine, </br> Random Forest, LightGBM, and </br> XGboost. </br> Our primary metric for optimization is the Recall Rate for predicting defaulted cases. </br> This is because for a bank loan default problem, rejecting loans falsely only leads to potential interest loss, </br> while the default of a loan leads to a significant loss of all principal.</br>
Logistic regression
clf = LogisticRegression(solver='saga',random_state=RAND_SEED).fit(X_train, y_train)
y_pred = clf.predict(X_test)
Cross validation
cross_val_score(clf, X_train, y_train, scoring='recall' ,cv=5, )
array([0.91574279, 0.90613452, 0.9084195 , 0.90915805, 0.90909091])
First prediction result
print(confusion_matrix(y_test,y_pred))
[[2548 352] [ 15 85]]
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.99 0.88 0.93 2900
1 0.19 0.85 0.32 100
accuracy 0.88 3000
macro avg 0.59 0.86 0.62 3000
weighted avg 0.97 0.88 0.91 3000
Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
distributions = dict(C=np.linspace(2, 1000, 100), penalty=['l2', 'l1'])
clf = RandomizedSearchCV(LogisticRegression(solver='saga',random_state=RAND_SEED), distributions, scoring='recall', n_iter=100, n_jobs = -1, random_state=RAND_SEED) clf_logistic = clf.fit(X_train, y_train) clf_logistic.bestparams
{'penalty': 'l2', 'C': 254.02020202020202}
distributions = dict(C=[254.02020202020202], penalty=['l2'])
clf = RandomizedSearchCV(LogisticRegression(solver='saga',random_state=RAND_SEED),
distributions,
scoring='recall',
n_iter=100,
n_jobs = -1,
random_state=RAND_SEED)
clf_logistic = clf.fit(X_train, y_train)
clf_logistic.best_params_
C:\Users\Saurav\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:285: UserWarning: The total space of parameters 1 is smaller than n_iter=100. Running 1 iterations. For exhaustive searches, use GridSearchCV.
{'penalty': 'l2', 'C': 254.02020202020202}
y_pred_logistic = clf_logistic.predict(X_test)
Tuned prediction result
print(confusion_matrix(y_test,y_pred_logistic))
[[2547 353] [ 15 85]]
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_logistic)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x21d144feb20>
print(classification_report(y_test,y_pred_logistic))
precision recall f1-score support
0 0.99 0.88 0.93 2900
1 0.19 0.85 0.32 100
accuracy 0.88 3000
macro avg 0.59 0.86 0.62 3000
weighted avg 0.97 0.88 0.91 3000
Support vector machine
clf = SVC(probability=True)
clf.fit(X_train, y_train)
SVC(probability=True)
y_pred = clf.predict(X_test)
Cross validation
cross_val_score(clf, X_train, y_train, scoring='recall' ,cv=5, )
array([0.94604582, 0.94826312, 0.94017725, 0.94239291, 0.94604582])
First prediction result
print(confusion_matrix(y_test,y_pred))
[[2427 473] [ 11 89]]
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 1.00 0.84 0.91 2900
1 0.16 0.89 0.27 100
accuracy 0.84 3000
macro avg 0.58 0.86 0.59 3000
weighted avg 0.97 0.84 0.89 3000
Hyperparameter tuning
distributions = dict(C=np.logspace(0, 4, 50), degree = np.linspace(1,10,1), class_weight = [None, 'balanced'], )
distributions = dict(C=[494.1713361323833],
degree = [1.0],
class_weight = [None],
)
# For training speed the iteration is set to 1.
# Given more time we can of course train more iters.
clf = RandomizedSearchCV(SVC(probability=True, cache_size = 1024*25),
distributions,
scoring='recall',
n_iter=1,
n_jobs = 1,
random_state=RAND_SEED)
clf_SVC = clf.fit(X_train, y_train)
clf_SVC.best_params_
{'degree': 1.0, 'class_weight': None, 'C': 494.1713361323833}
{'degree': 1.0, 'class_weight': None, 'C': 494.1713361323833}
y_pred_SVC = clf_SVC.predict(X_test)
Tuned prediction result
print(confusion_matrix(y_test,y_pred_SVC))
[[2437 463] [ 13 87]]
# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred_SVC)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x21d1480fca0>
print(classification_report(y_test,y_pred_SVC))
precision recall f1-score support
0 0.99 0.84 0.91 2900
1 0.16 0.87 0.27 100
accuracy 0.84 3000
macro avg 0.58 0.86 0.59 3000
weighted avg 0.97 0.84 0.89 3000
Random forest
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
RandomForestClassifier()
y_pred = clf.predict(X_test)
Cross validation
cross_val_score(clf, X_train, y_train, scoring='recall' ,cv=5, )
array([0.95417591, 0.96008869, 0.95273264, 0.95273264, 0.95934959])
First prediction result
print(confusion_matrix(y_test,y_pred))
[[2671 229] [ 31 69]]
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.99 0.92 0.95 2900
1 0.23 0.69 0.35 100
accuracy 0.91 3000
macro avg 0.61 0.81 0.65 3000
weighted avg 0.96 0.91 0.93 3000
Hyperparameter tuning
distributions = dict(n_estimators=np.arange(10, 500, 10), criterion=['gini', 'entropy'], max_depth = range(20), min_samples_split = range(2, 20), min_samples_leaf = range(3, 50), bootstrap = [True, False], class_weight = ['balanced', 'balanced_subsample'] )
clf = RandomizedSearchCV(RandomForestClassifier(), distributions, scoring='recall', n_iter=20, n_jobs = 4, random_state=RAND_SEED) clf_random_forest = clf.fit(X_train, y_train) clf_random_forest.bestparams
{'n_estimators': 490, 'min_samples_split': 14, 'min_samples_leaf': 5, 'max_depth': 8, 'criterion': 'gini', 'class_weight': 'balanced_subsample', 'bootstrap': False}
distributions = dict(n_estimators=[490],
criterion=['gini'],
max_depth = [8],
min_samples_split = [14],
min_samples_leaf = [5],
bootstrap = [False],
class_weight = ['balanced_subsample']
)
clf = RandomizedSearchCV(RandomForestClassifier(),
distributions,
scoring='recall',
n_iter=20,
n_jobs = 4,
random_state=RAND_SEED)
clf_random_forest = clf.fit(X_train, y_train)
clf_random_forest.best_params_
C:\Users\Saurav\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:285: UserWarning: The total space of parameters 1 is smaller than n_iter=20. Running 1 iterations. For exhaustive searches, use GridSearchCV.
{'n_estimators': 490,
'min_samples_split': 14,
'min_samples_leaf': 5,
'max_depth': 8,
'criterion': 'gini',
'class_weight': 'balanced_subsample',
'bootstrap': False}
y_pred_random_forest = clf_random_forest.predict(X_test)
Tuned prediction result
print(confusion_matrix(y_test,y_pred_random_forest))
[[2547 353] [ 18 82]]
# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred_random_forest)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x21d148ac4c0>
print(classification_report(y_test,y_pred_random_forest))
precision recall f1-score support
0 0.99 0.88 0.93 2900
1 0.19 0.82 0.31 100
accuracy 0.88 3000
macro avg 0.59 0.85 0.62 3000
weighted avg 0.97 0.88 0.91 3000
LightGBM
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)
LGBMClassifier()
y_pred = clf.predict(X_test)
Cross validation
cross_val_score(clf, X_train, y_train, scoring='recall' ,cv=5, )
array([0.95934959, 0.9563932 , 0.95420975, 0.95125554, 0.96452328])
First prediction result
print(confusion_matrix(y_test,y_pred))
[[2616 284] [ 33 67]]
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.99 0.90 0.94 2900
1 0.19 0.67 0.30 100
accuracy 0.89 3000
macro avg 0.59 0.79 0.62 3000
weighted avg 0.96 0.89 0.92 3000
Hyperparameter tuning
distributions = { 'learning_rate': np.logspace(-5, 2, 50), 'num_leaves': np.arange(10, 100, 10), 'max_depth' : np.arange(3, 13, 1), 'colsample_bytree' : np.linspace(0.1, 1, 10), 'min_split_gain' : np.linspace(0.01, 0.1, 10), }
clf = RandomizedSearchCV(lgb.LGBMClassifier(), distributions, scoring='recall', n_iter=100, n_jobs = 4, random_state=RAND_SEED) clf_lgb = clf.fit(X_train, y_train) clf_lgb.bestparams
{'num_leaves': 60, 'min_split_gain': 0.030000000000000006, 'max_depth': 8, 'learning_rate': 0.07196856730011514, 'colsample_bytree': 0.7000000000000001}
distributions = {
'learning_rate': [0.07196856730011514],
'num_leaves': [60],
'max_depth' : [8],
'colsample_bytree' : [0.7000000000000001],
'min_split_gain' : [0.030000000000000006],
}
clf = RandomizedSearchCV(lgb.LGBMClassifier(),
distributions,
scoring='recall',
n_iter=100,
n_jobs = 4,
random_state=RAND_SEED)
clf_lgb = clf.fit(X_train, y_train)
clf_lgb.best_params_
C:\Users\Saurav\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:285: UserWarning: The total space of parameters 1 is smaller than n_iter=100. Running 1 iterations. For exhaustive searches, use GridSearchCV.
{'num_leaves': 60,
'min_split_gain': 0.030000000000000006,
'max_depth': 8,
'learning_rate': 0.07196856730011514,
'colsample_bytree': 0.7000000000000001}
y_pred_lgb = clf_lgb.predict(X_test)
Tuned prediction result
print(confusion_matrix(y_test,y_pred_lgb))
[[2595 305] [ 27 73]]
# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred_lgb)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x21d145d0040>
print(classification_report(y_test,y_pred_lgb))
precision recall f1-score support
0 0.99 0.89 0.94 2900
1 0.19 0.73 0.31 100
accuracy 0.89 3000
macro avg 0.59 0.81 0.62 3000
weighted avg 0.96 0.89 0.92 3000
XGBoost
clf = XGBClassifier()
clf.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)
y_pred = clf.predict(X_test)
Cross validation
cross_val_score(clf, X_train, y_train, scoring='recall' ,cv=5, )
array([0.95934959, 0.9578714 , 0.95642541, 0.95199409, 0.95934959])
First prediction result
print(confusion_matrix(y_test,y_pred))
[[2619 281] [ 31 69]]
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.99 0.90 0.94 2900
1 0.20 0.69 0.31 100
accuracy 0.90 3000
macro avg 0.59 0.80 0.63 3000
weighted avg 0.96 0.90 0.92 3000
Hyperparameter tuning
distributions = { 'n_estimators': np.arange(100, 1000, 100), 'max_depth':np.arange(2,10,1), 'learning_rate':np.logspace(-4, 1, 50), 'subsample':np.linspace(0.1, 1, 10), 'colsample_bytree':np.linspace(0.1, 1, 10), }
clf = RandomizedSearchCV(XGBClassifier(), distributions, scoring='recall', n_iter=10, n_jobs = 4, random_state=RAND_SEED) clf_xgb = clf.fit(X_train, y_train) clf_xgb.bestparams
{'subsample': 0.9, 'n_estimators': 600, 'max_depth': 8, 'learning_rate': 0.008685113737513529, 'colsample_bytree': 0.6}
distributions = { 'n_estimators': [600],
'max_depth':[8],
'learning_rate':[0.008685113737513529],
'subsample':[0.9],
'colsample_bytree':[0.6], }
clf = RandomizedSearchCV(XGBClassifier(),
distributions,
scoring='recall',
n_iter=10,
n_jobs = 4,
random_state=RAND_SEED)
clf_xgb = clf.fit(X_train, y_train)
clf_xgb.best_params_
C:\Users\Saurav\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:285: UserWarning: The total space of parameters 1 is smaller than n_iter=10. Running 1 iterations. For exhaustive searches, use GridSearchCV.
{'subsample': 0.9,
'n_estimators': 600,
'max_depth': 8,
'learning_rate': 0.008685113737513529,
'colsample_bytree': 0.6}
y_pred_xgb = clf_xgb.predict(X_test)
Tuned prediction result
print(confusion_matrix(y_test,y_pred_xgb))
[[2582 318] [ 23 77]]
# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred_xgb)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x21d1494cdf0>
print(classification_report(y_test,y_pred_xgb))
precision recall f1-score support
0 0.99 0.89 0.94 2900
1 0.19 0.77 0.31 100
accuracy 0.89 3000
macro avg 0.59 0.83 0.62 3000
weighted avg 0.96 0.89 0.92 3000
Model assessment
ROC curve
sns.set()
model_names = ['LogisticRegression','SVM', 'RandomForest','LightGBM','XGBoost']
models = [clf_logistic, clf_SVC, clf_random_forest, clf_lgb, clf_xgb]
plt.figure(figsize=(8, 6))
for name, model in zip(model_names, models):
prob = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, prob)
model_auc = round(auc(fpr, tpr), 4)
plt.plot(fpr,tpr,label="{}, AUC={}".format(name, model_auc))
random_classifier=np.linspace(0.0, 1.0, 100)
plt.plot(random_classifier, random_classifier, 'r--')
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.legend()
plt.show()
Given the imbalanced nature of our dataset, our emphasis is on the precision-recall curve. Based on the test set outcome, it can be concluded that the Logistic regression model performed well.
The purpose of this notebook is to work with an imbalanced loan default dataset using multiple ML models. Our findings reveal that the Random Forest model achieved the highest Recall rate of 89% on the test set. However, the Logistic Regression model surpassed all other models with the top AUC score of 0.5238 in the precision-recall curve. With the addition of more features and feature engineering, there is potential to further enhance the results in the future.